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1.  Introduction 


The  replication  scheme  of  a  distributed  database  determines  how  many  replicas  of  each  object  are  created,  and 
to  which  processors  these  replicas  arc  allocated.  This  scheme  critically  affects  the  performance  of  a  distributed  sys¬ 
tem,  since  reading  an  object  locally  is  less  costly  than  reading  it  from  a  remote  processor.  Therefore  in  a  read¬ 
intensive  network  a  widely  distributed  replication  is  mandated.  On  the  other  hand,  an  update  of  an  object  is  usually 
written  to  all,  or  a  majority  of  the  replicas,  and  therefore  in  a  write-intensive  network  a  narrowly  distributed  replica¬ 
tion  is  mandated.  In  other  words,  the  opdmal  replication  scheme  depends  on  the  read-write  pattern  for  each  object. 

Currently,  the  replication  scheme  is  established  in  a  static  fashion,  when  the  distributed-database  is  designed, 
and  it  remains  fixed  until  the  designer  manually  intervenes  to  change  the  number  of  replicas  or  their  location.  If  the 
read-write  patterns  are  fixed  and  are  known  a  priori,  this  is  a  reasonable  solution.  However,  if  the  read-write  patterns 
change  dynamically,  in  unpredictable  ways,  a  static  replication  scheme  may  lead  to  severe  performance  problems. 

In  this  paper  we  propose  a  practical  algorithm,  called  Dynamic-Data-Allocation  (DDA),  that  changes  the  repli¬ 
cation  scheme  of  an  object  (i.e.  the  processors  which  store  a  replica  of  the  object)  dynamically  as  the  read-write  pat¬ 
tern  of  the  object  changes  in  the  network.  We  assume  that  the  changes  in  the  read-write  pattern  are  not  known  a 
priori. 

The  algorithm  DDA  is  based  on  the  primary  copy  method,  it  preserves  1-copy-serializability,  and  it  is  distri¬ 
buted  in  the  sense  that  the  decision  on  whether  or  not  to  store  a  copy  of  the  object  is  made  by  each  individual  proces¬ 
sor  based  on  locally  collected  statistics.  The  algorithm  changes  the  replication  scheme  of  an  object  to  optimize  a  glo¬ 
bal  cost  function  for  the  current  read-write  pattern.  As  the  read-write  pattern  changes,  so  does  the  replication 
scheme.  Specifically,  if  during  a  period  of  time  it  is  cost  effective  for  some  processor  to  store  a  copy  of  the  object 
(due  to  the  dominant  cost  of  reads  initiated  locally),  then  it  will  do  so.  When  the  processor  determines  that,  due  to  the 
dominant  cost  of  writes  initiated  at  other  processors,  it  is  not  cost  effective  to  retain  the  copy,  then  it  will  give  it  up. 

The  DDA  algorithm  is  also  integrated.  In  such  an  algorithm,  redistribution  of  the  replicas  is  integrated  into  the 
processing  of  reads  and  writes,  instead  of  executing  independently  of  these  operations.  So,  for  example,  a  processor 
x  creates  a  replica  of  an  object  at  some  neighbor,  y,  in  response  to  y’s  request  to  read  the  object  from  x.  x  creates  the 
replica  by  piggybacking,  on  the  message  to  y  containing  the  object,  an  indication  that  y  should  keep  a  replica  (and 
that  x  will  propagate  writes  toy).  Additionally,  we  show  that  the  DDA  algorithm  can  cope  with  failures. 


2.  The  Cost  Function 


Consider  a  data  object  O  in  a  database  that  is  replicated  over  a  set  of  processors  interconnected  by  a  communi¬ 
cation  network.  The  set  of  processors  at  which  0  is  allocated  is  called  the  0 -scheme.  Transactions  executing  in  the 
distributed  system  issue  logical  reads  and  logical  writes  of  0.  Each  logical  read  is  translated  into  a  read  of  a  physical 
replica  of  0,  and  each  logical  write  is  translated  into  the  write  of  all  physical  replicas  of  0.  A  logical-read  (write) 
will  be  called  a  read  (write)  for  short. 

Suppose  that  the  cost  for  a  read  operation  of  0,  executed  at  a  processor  that  has  a  copy  of  0  is  one  (the  I/O  J 
cost).  The  cost  for  a  read  operation  executed  at  a  processor  that  does  not  have  a  copy  of  the  item  is  1  +  d,  where  d  is 
a  constant  representing  the  communication  cost  of  transmitting  the  item  from  one  processor  to  another.  The  cost  of  a 
write  is  proportional  to  the  number  of  copies,  c.  Specifically,  for  each  copy  there  is  a  cost  of  transmission  (of  the 
object  to  the  processor,  and  an  I/O  cost  (of  writing  the  object).  Therefore,  the  cost  of  a  write  is1  c  (l  +d). 

1  Strictly  speaking,  the  write  from  a  processor  that  stores  a  copy  of  0  costs  only  c-(l  +d)  -  d,  since  the  communication 
cost,  d,  is  not  incurred  for  the  local  copy.  We  ignore  this  point  in  the  present  paper,  and  we’ll  just  mention  that  all  the 
statements  made  in  the  rest  of  the  paper  can  be  easily  adapted  for  the  more  complicated  cost  function. 
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Consider  a  distributed  database  system  in  which  a  processor  i  performs  tiW,  writes  of  0  and  #/?,  reads  of  0 , 
during  each  time  unit  (For  example,  the  time  unit  may  be  three  minutes.)-  Then,  the  total  cost  of  reading  and  writing 
0  during  a  time-unit  is  a  function  of  the  O-scheme.  It  is  quite  straight-forward  to  show  that  this  cost  function  is 

minimum  when  the  O-scheme  consists  of  the  set  of  processors  {  k  I  (1  +d)-'£tiWj  <  d-tiRk  ).  In  other  words,  a  copy 

j 

is  allocated  to  a  processor  k  if  and  only  if:  1  +d  times  the  total  number  of  writes  in  the  network  during  a  time  unit,  is 
not  higher  than  d  times  the  number  of  reads  issued  by  k  during  a  time  unit.  Intuitively,  the  reason  for  this  being  the 
optimal  allocation  scheme  is  the  following.  Allocating  the  copy  to  a  processor  k  increases  the  cost  of  each  write  of  0 
(issued  by  any  processor  in  the  network)  by  1  +d,  and  it  decreases  the  cost  of  each  read  issued  by  k  by  d  (since  k  saves 
the  communication  cost  of  remotely  reading  0).  Therefore,  a  copy  of  0  should  be  placed  at  each  processor  which 
causes  the  increase  in  cost  to  be  lower  than  the  decrease  in  cost. 

3.  The  DDA  Algorithm 

The  optimal  replication  scheme  defined  at  the  end  of  the  last  section  optimizes  the  cost  of  accessing  an  object 
0  under  the  assumption  that  the  read-write  pattern,  i.e.,  tiWj  and  tiRj  for  each  j,  are  fixed;  viz.,  for  each  time  unit,  a 
processor  j  performs  tiWj  writes  and  tiRj  reads.  In  practice,  the  read-write  pattern  may  change  over  time.  The  DDA 
algorithm  presented  in  this  section,  adapts  the  O-scheme  to  optimize  the  cost  function  for  the  current  time-unit. 

The  DDA  algorithm  consists  of  two  parts.  First,  the  execution  of  reads  and  writes  of  O,  and  second,  changing 
of  the  O-scheme.  We  first  describe  the  read  and  write  algorithm;  it  is  based  on  the  primary-copy  method.  At  any 
point  in  time  there  is  a  processor,  called  the  primary  -copy  processor  and  denoted  P  (O),  that  stores  the  primary  copy 
of  0.  P  (O)  services  the  reads  of  processors  that  do  not  have  a  copy  of  0,  as  well  as  the  writes  of  0.  Processor  P(0) 
knows  which  other  processors  have  copies  of  O.  A  processor  that  does  not  have  a  copy  of  0  forwards  read  and  write 
requests  for  0  to  P  (0).  Write  requests  are  forwarded  by  all  the  processors  to  P  (0),  which  in  turn  propagates  them  to 
all  the  processors  of  the  O-scheme  (except  for  the  processor  that  issued  the  write,  obviously). 

Changing  the  O-scheme  (the  second  part  of  DDA)  consists  of  two  comparisons  called  exit  (the  O-scheme)  and 
enter.  The  exit  comparison  is  executed  by  every  processor  of  the  O-scheme,  and  the  enter  comparison  is  executed  by 
P(0). 

Now  we  elaborate  upon  these  two  comparisons.  For  each  write  received  by  a  processor  j  of  the  O-schcme,  j 
compares  two  values  that  were  determined  by  the  fact  that  j  kept  a  copy  of  0  during  the  last  time  unit.  One  value  is 
the  increase  in  the  cost  of  writes,  and  the  other  is  the  decrease  in  the  cost  of  reads.  If  the  increase  is  higher  than  the 
decrease,  then  j  relinquishes  its  copy.  Specifically,  when  a  write  is  received  by  j,  it  performs  the  following  com¬ 
parison. 

(Exit)  If  during  the  time  unit  that  ends  at  the  present  time  (1  +d)-'E#Wi  >  d-tiR,  then  j  relinquishes  its 

I 

copy. 

The  procedure  for  relinquishing  the  copy  depends  on  the  processor.  Specifically,  P  (0)  relinquishes  the  copy 
differently  than  another  processor  j  of  the  O-scheme.  j  relinquishes  the  copy  by  simply  requesting  P(0)  not  to  pro¬ 
pagate  any  more  writes  to  j.  Further  read  requests  of  j  will  be  sent  to  P  (0).  In  contrast,  P  (0)  relinquishes  the  copy 
by  selecting  another  processor  of  the  O-scheme,  say  i,  and  designating  i  as  the  new  primary-copy  processor.  If  P{0) 
is  the  only  processor  in  the  O-scheme,  then  the  new  primary-copy  processor  i  is  the  one  that  issued  the  maximum 
number  of  reads  during  the  time  unit.  Designating  a  new  primary-copy  processor  involves  performing  the  following 
three  functions:  1.  P(0)  informs  i  of  the  designation,  and  sends  to  i  the  identification  of  the  processors  comprising 
the  current  O-scheme.  2.  P  (0)  broadcasts  to  all  the  processors  in  the  network  the  fact  that  i  in  the  new  primary-copy 
processor.  3.  If  subsequently  (the  old)  P(0)  receives  read  and  write  requests  (because,  for  example,  these  may 


already  have  been  on  their  way  at  the  time  the  primary-copy  processor  switch  was  announced),  then  it  simply  pro¬ 
pagates  them  to  the  new  P  (0)  (i). 

The  enter  comparison  is  performed  by  P(0)  upon  the  receipt  of  a  read  request  from  a  processor  j  that  does  not 
have  a  copy  of  0.  P  ( 0 )  compares  two  values  that  were  determined  by  the  fact  that  j  did  not  keep  a  copy  of  0  during 
the  last  time  unit.  One  value  is  the  saving  in  the  cost  of  writes,  and  the  other  is  the  penalty  in  the  in  the  form  of 
increased  cost  of  reads  from  j.  If  the  increased  cost  is  higher,  then  P(0)  allocates  a  copy  of  0  at  j.  Specifically, 
when  P(0)  receives  a  read  request  from  j,  it  performs  the  following  comparison. 

(Enter)  If  during  the  time  unit  that  ends  at  the  present  time,  (l+</)-£#W,  S  d  tfRj,  then  P(0)  tells  j  to 

< 

join  the  O-scheme  (by  saving  a  copy  of  0\P(0)  commits  to  propagate  to  j  further  writes  of  0). 

4.  Practical  Considerations  for  the  DDA  Algorithm 

In  the  DDA  algorithm,  concurrency  control  can  be  performed  as  follows.  Each  write  exclusively  locks  all  the 
copies  of  0,  and  each  read  share-locks  the  primary  copy  (at  P(0))  or  the  local  copy,  depending  on  the  processor 
which  performs  the  read.  It  is  easy  to  see  that  two-phase  locking  combined  with  this  scheme  ensures  1-copy- 
serializability. 

Since  the  number  of  copies  of  0  varies  in  time,  the  dynamic  allocation  algorithm  is  vulnerable  to  failures  that 
may  render  0  inaccessible.  To  address  this  problem,  the  user  may  impose  reliability  constraints  of  the  following 
form:  “The  number  of  copies  cannot  decrease  below  a  threshold,  say  If  such  constraint  is  present,  then  P(0) 
refuses  to  accept  the  exit  of  an  0-scheme  processor,  if  such  exit  will  downsize  the  0-scheme  below  the  threshold.  In 
other  words,  P(0)  informs  a  processor  j  that  requests  to  relinquish  the  copy  that  the  request  is  denied;  subsequently, 
writes  continue  to  be  propagated  to  j.  j  continues  to  reissue  the  request  whenever  the  exit  comparison  dictates  to  do 
so.  The  request  may  be  granted  later  on,  if  the  0-scheme  expanded  in  the  meantime. 

Failure  of  any  processor,  except  P  (0),  does  not  affect  the  execution  of  reads  and  writes  at  the  other  processors. 
If  P  (0)  fails  and  there  are  no  other  processors  in  the  0-scheme,  then  0  will  be  unavailable  until  P  (0)  recovers.  Oth¬ 
erwise,  the  remaining  processors  will  execute  an  election  protocol  to  select  the  new  P(0).  In  case  of  network  parti¬ 
tion,  the  partition  containing  the  primary-copy  processor  will  continue  to  process  reads  and  writes,  and  all  the  others 
will  not  do  so. 

Notice  that  in  the  DDA  algorithm,  the  user  may  change  the  sensitivity  of  the  algorithm  to  variations  in  the 
read/wnte  pattern,  by  changing  the  length  of  the  time-unit.  The  shorter  the  time  unit,  the  more  sensitive  the  algo¬ 
rithm.  On  the  other  hand,  shorter  time-units  induce  more  frequent  changes  of  the  0-scheme. 

5.  Relevant  Work 

Many  performance-oriented  works  on  replicated  data  consider  the  static  problem  of  replication,  namely  estab¬ 
lishing  a  priori  a  replication  scheme  that  will  optimize  performance,  but  will  remain  fixed  at  runtime.  This  is  called 
the  file-allocation  problem,  and  it  has  been  studied  extensively  in  the  literature  (see  [DF]  for  a  survey).  Another 
approach  to  improve  the  performance  in  a  replicated  distributed  database  is  to  relax  the  serializability  requirement 
Works  on  quasi-copies  ([ABG1,  ABG2,  BG]),  lazy  replication  ([LLS]),  and  bounded  ignorance  ([KB])  fall  in  this 
category.  These  works  also  assume  a  static  replication  scheme.  In  contrast,  our  approach  preserves  one-copy- 
serializability,  since  the  DDA  algorithm  is  read-one-write-all  (although  the  meaning  of  “all”  changes  dynamically). 

In  the  theoretical  computer  science  community  there  has  been  work  on  online  algorithms  (e.g.  [BLS]),  particu¬ 
larly  for  paging  (e.g.  (BS,  CL,  ST]),  searching  (e.g.  [ST])  and  caching  (e.g.  [KMRS]).  These  works  are  similar  in 
spirit  to  the  DDA  algorithm;  however,  the  models  in  such  works  is  inappropriate  for  managing  replication  in 
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distributed  databases.  For  example,  the  cost  of  I/O,  that  is  considered  by  the  DDA  algorithm,  is  not  a  factor  in  cach¬ 
ing;  and  the  replacement  issue  (i.e.  which  page  to  replace),  that  is  important  in  paging,  usually  is  not  a  factor  in  repli¬ 
cated  data  management 

Finally,  in  [WJ]  we  proposed  algorithms  for  dynamic  replication.  However,  these  algorithms  were  dependent 
on  the  network  having  the  tree  topology,  a  limitation  removed  by  the  DDA  algorithm;  also,  the  [WJ]  algorithms 
ignored  the  I/O  cost  of  replication,  and  they  were  not  amenable  to  changing  the  frequency  of  O-schema  adaptation. 
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